Executive Summary

Swiftkey is a text messaging application that predicts the next word as you type a message. When the user types: "I went to the " : the application presents three options for what the next word might be. For example, the three words might be gym, store, restaurant.

In this project, we use R text mining tools to build a application that predicts the next word entered into an shiny application. We use data scraped from blogs, twitter and news sites to build a language model to perform prediction. We predict the next word by calculating the maximum likelihood estimate (MLE) based on the language model.

TL;DR

In a nutshell, here’s a summary of the data analysis performed in this report.

  1. Load the raw data.
  2. Extract a 1% subsample of the data.
  3. Preprocess the data to remove stopwords, convert to lowercase and other tasks
  4. Generate 2-grams, 3-grams and 4-grams.
  5. Build a n-gram language model using a suffix tree
  6. Predict the next word by calculating the maximum likelihood estimate (MLE) for all suffixes for a search string.

Complete Dataset

The Data Science Capstone Course provided text data from 3 data sources: blogs, twitter and news. The table below shows the number of lines from data source.

filename line_count
en_US.blogs.txt 899288
en_US.news.txt 1010242
en_US.twitter.txt 2360148
Total 4269678

Processing Unstructured Text for Analysis

We use a framework for text mining applications within R to transform the unstructured text to a structured document-by-term matrix format required for analysis. The first step is “clean” the text with a series of text processing functions. We use the following preprocessing steps.

Second, we tokenize the text into ngrams. For our analysis, a gram equals a whitespace delimited sequence of characters corresponding to an English word. The N in ngram corresponds to the number of words considered to be part of the same unit. For example: a 1-gram is new, 2-gram is new york, 3-gram is new york city and a 4-gram is new york city police.

Finally, we build a document-by-term matrix by transforming the ngrams to a bag-of-words model. The columns in a docterm matrix represent unique ngram. The rows represent a document and the frequency that each ngram appears in the document.

Sampling the Dataset

The 4,269,678 lines from the complete dataset can be memory intensive for the text mining tools and slow the analysis. To speed things up, we subsample 1% of the complete dataset and then work with the subsampled data for exploration and modeling. The subsampling implementation is in Appendix 1.

source num_lines num_unique_words median_word_freq mean_word_freq
twitter 23602 8040 9 20
blogs 8993 12414 6 15
news 10103 12850 7 15

We see that the mean word frequency is nearly twice the median frequency for all data sources. The word frequency distribution has a long tail - as seen in the plot below. 937 of 22899 (4.1%) words cover 50% of all word instances in the dataset. The mean word frequency is heavily weighted by a few words that occur very frequently.

NGram Frequencies

Now we will look at the top bigrams and trigrams in the dataset.

Top Bigrams

The top bigrams include places like: new york and common phrases like: happy birthday and last night.

Top Trigrams

The top trigrams include places like: new york city and common phrases like: happy mothers day and rock n roll.

Language Model for Word Prediction

A suffix tree is a data structure where an interior node represents a “root” character sequence from the training data and child node represents a suffix of that root. For our implementation, each node represents a word and an arc (e.g. search path) represents a bigram, trigram or 4-gram. Each node has an associated count that represents the ngram frequency from the training data. We use these counts to predict the next word by calculating the maximimum likelihood estimate (MLE).

Word Prediction for: ‘data’

For example: in the figure below we show a suffix tree for ngrams that begin with the word data. The first level (w1) of the tree corresponds to the 1-gram “data”. The second level (w2) represents all bigrams whose root is “data” : such as “data entry” or “data streams”. The same applies to the third (w3) and fourth (w4) levels for trigrams and 4-grams respectively.

Each node contains the ngram frequency count from the training data. We calculate the MLE as P_mle(w2|w1) = C(w1...w2)/C(w1). The frequency count for “data” is 198. The frequency count for “entry” is 12 and “streams” is 10. The probability of “data entry” is P_mle(entry|data) = 12/198 = 0.06 = 6%. The probability for “data streams” is P_mle(streams|data) = 10/198 = 0.05 = 5%. Therefore, the language model would predict that “entry” is the most likely next word.

results <- perform_search(ngram_tree, c("data"))
print(results)
##                   12                   10                  
## recommended_words "entry"              "streams"           
## likelihood        "0.0606060606060606" "0.0505050505050505"
##                   8                    7                   
## recommended_words "recovery"           "dating"            
## likelihood        "0.0404040404040404" "0.0353535353535354"
##                   7                   
## recommended_words "personalize"       
## likelihood        "0.0353535353535354"

Word Prediction for: ‘data entry’

Now let’s predict the next word after: “data entry”,

We calculate the MLE as: P_mle(w3|w1...w2) = C(w1...w3)/C(w2). The frequency count for “data entry” is 12. The frequency count for “just” is 6 and “respond” is 6. The probability of “data entry just” is P_mle(just|data entry) = 6/12 = 0.50 = 50%. The probablility of “data entry respond” is P_mle(respond|data entry) = 6/12 = 0.50 = 50%. The language model would predict that “just” and “respond”" are equally likely to be the next word.

results <- perform_search(ngram_tree, c("data", "entry"))
print(results)
##                   6      6        
## recommended_words "just" "respond"
## likelihood        "0.5"  "0.5"

Conclusion

The preliminary results show that we can build a language model using ngrams from training data from blogs, news and twitter. We can build a word predictor using a suffix tree built from ngrams extracted from the training data.

Next Steps

Here are the list of next steps for the coming weeks.

  • Use a weighting technique like TF/IDF to adjust the ngram frequencies.
  • Apply text preprocessing to search text.
  • Implement backoff search to handle instances when a phrase is not found in the tree.
  • Build a model using the more than a 1% sample.
  • Deploy the ngram tree to the server-side of an Shiny Application.

Appendix

Appendix 1: Subsampling Code

This code collects a 1% sample using a “coin flip” to decide which lines to choose.

# sample the datasci dir
sample_capstone_data <- function(fn, outfn, sample_len=0.01) {
  print(sprintf("Reading %s", fn))
  lines <- readLines(fn)
  set.seed(123)
  print(sprintf("Read %s Length %s", fn, length(lines)))
  lines_sample <- lines[rbinom(length(lines)*sample_len, length(lines), 0.5)]
  print(sprintf("Writing %s. Length %s", outfn, length(lines_sample)))
  write.csv(lines_sample, file=outfn, row.names=FALSE, col.names=FALSE)
}

sample_capstone_data("./data/final/en_US/en_US.twitter.txt",
                     "./data/final/en_US/sample/en_US.twitter.txt")
sample_capstone_data("./data/final/en_US/en_US.blogs.txt",
                     "./data/final/en_US/sample/en_US.blogs.txt")
sample_capstone_data("./data/final/en_US/en_US.news.txt",
                     "./data/final/en_US/sample/en_US.news.txt")